Excel is a general spreadsheet tool, with some statistical capabilityMinitab is a general-purpose statistical package, with an intuitive graphical interfaceR is a general-purpose programming language with an emphasis on statistics and data scienceMinitab and R have modes of operation that enforce good data practice; Excel does notR has an extremely wide range of libraries and packages for specialised and advanced statistical analysis, which are not available in either Minitab or ExcelTo participate in this workshop or to work through the example in your own time, you will need the following on your own computer:
OpenRefine dataset from the “Cleaning Data With OpenRefine” workshop: esoph-tab.tsv (or whatever you named it)Excel (you have a licence for this via the university)Minitab (you have a licence for this via the university)R and RStudioPlease see the configuration notebook for help and guidance on setting up R and RStudio.
To download and install the latest versions of Excel and Minitab please visit the university’s IT pages.
Data can be collected, stored, and analysed in a variety of ways, using a wide range of tools. We do not prescribe any particular approach, other than to note best practices for FAIR and op science prefer open, plain-text, human-readable non-proprietary data formats that are shared in public repositories. The best tool for data analysis for your work may differ depending on your area, and from project to project. Most good tools will allow you to read and write data in exchangeable formats compatible with FAIR and open science. In this workshop we will explore three common tools: Excel, Minitab and R, and use these to open and analyse our cleaned dataset esoph-tab.tsv from the earlier workshop.
ExcelMicrosoft Excel is almost ubiquitous. It is now available free of charge as a cloud service, as part of Office3651. Alternatives to Excel, such as Google Sheets2 and Libre Office3, work in much the same way. This workshop should be adaptable to those alternatives with minimal modification.
ExcelOpenesoph-tab.tsv, select it and click Openknitr::include_graphics("images/excel-01_landing.png")
knitr::include_graphics("images/excel-02_select.png")
For any format other than its own, proprietary .xls or .xlsx format, Excel will start the Text Import Wizard.
This makes loading the simplest datasets seem more complicated than it really is, and serves to nudge users towards Microsoft’s own format, and away from good data practice.
knitr::include_graphics("images/excel-03_import1.png")
The Text Import Wizard should correctly identify that your data are delimited4.
However, Excel may not notice that the first row is a header row5, and may treat all rows equivalently.
Next >knitr::include_graphics("images/excel-04_import2.png")
On the next page of the wizard, Excel indicates that it has identified the tab (\t) character as the column delimiter. You would be able to select a different character if this was incorrect. Here, no changes are necessary.
Next >knitr::include_graphics("images/excel-05_import3.png")
In the next wizard page, Excel allows you to select each column’s data format. These formats are: General, Text, Date, and “skip column”.
Excel does not allow you here to specify data formats or data types useful for statistical or data analysis.
In particular, it does not distinguish between categorical and numerical data. It does not allow you to specify whether numerical values should be counts (whole numbers) or “real” values (can take any decimal value).
Importing data into Excel has very limited data validation.
FinishExcel now presents your data in spreadsheet format.
knitr::include_graphics("images/excel-06_sheet.png")
Here, Excel again attempts to nudge us towards using one of the proprietary Excel formats by claiming that there is **Possible Data Loss:** Some features might be lost if you save this workbook in the comma-delimited (.csv) format. To preserve these features, save it in Excel format.
This is misleading. Our data should “safe” so long as we keep it in plain text, open, non-proprietary formats.
Excel were able to distinguish header rows from data rows?Excel import our data correctly?Excel did not import our data correctly.
Excel interpreted the range 10-19 as a date: October 19th. Our tobgp column now has a number of dates interspersed among the values. This is bad: Excel has changed our data without asking or notifying us that it would do so.
To avoid this problem, we must specify Text as the data format for our columns:
ExcelOpenesoph-tab.tsv, select it and click OpenNext >Next >Textknitr::include_graphics("images/excel-07_text.png")
Finishknitr::include_graphics("images/excel-08_sheet.png")
Excel has now imported our data, and it looks to be correct.
Excel import our data correctly?Google Sheets is a cloud-based spreadsheet package, free for personal use↩︎
Libre Office is a free, open office quite intended as a drop-in, free replacement for Microsoft Office↩︎
delimited: separated by a character that indicates a new column is starting. This is often a tab (\t) or comma (,)↩︎
header row: a row in a table that indicates the contents of each column↩︎